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Many risk factors/interventions in epidemiologic/biomedical studies are of minuscule effects. To detect 
such weak associations, one needs a study with a very large sample size (the number of subjects, n) . The n of a 
study can be increased but unfortunately only to an extent. Here, we propose a novel method which hinges 
on increasing sample size in a different direction-the total number of variables (p). We construct a p-based 
'multiple perturbation test', and conduct power calculations and computer simulations to show that it can 
achieve a very high power to detect weak associations when p can be made very large. As a demonstration, we 
apply the method to analyze a genome-wide association study on age-related macular degeneration and 
identify two novel genetic variants that are significantly associated with the disease. The p-based method 
may set a stage for a new paradigm of statistical tests. 

Many risk factors/interventions in epidemiologic/biomedical studies are of minuscule effects 1 . For 
example, television viewing was found to increase the risks of type 2 diabetes, cardiovascular disease 
and all-cause mortality, but the effects in terms of relative risks are small: 1.20, 1.15 and 1.13 2 , respect- 
ively; regular supplement of vitamin C was associated with a shortening of the duration of common colds, but 
with a relative risk (0.92) very near unity 3 . Moving into this '-omics' era, for the first time researchers are 
becoming able to probe into study subjects' genome, transcriptome, and metabolome, etc, to search for possible 
disease associations. However, the associations found so far were still very weak; for example the great majority of 
the odds ratios of genetic polymorphisms in genome-wide association studies were less than 1.5 4,5 . 

To detect weak associations, a very large sample size is needed. For example, in genome-wide association 
studies, the sample sizes have steeply increased from a few hundreds in the first study of age-related macular 
degeneration 6 to tens of thousands in recent meta-analyses 7,8 . Also, the consortium-based studies are becoming 
increasingly indispensible as the single-institution studies often cannot meet the tough sample-size requirements. 
For example, the Wellcome Trust Case-Control Consortium 9 , the United Kingdom Biobank 10 and China 
Kadoorie Biobank 11 have recruited study subjects in the order of hundreds of thousands. But how big is big 
enough for sample size? A simulation study suggested that in some scenarios the sample size needed can easily go 
up to the millions! 12 Certainly, there is a limit for the total number of subjects any research institution, any meta- 
analysis and any consortium can possibly assemble. 

Traditionally, sample sizes are measured in terms of the total number of study subjects («). In this study, we 
propose a novel 'p-based' method which hinges on increasing sample size in a different direction-the total 
number of variables (p).We construct a p-based 'multiple perturbation test', and conduct theoretical power 
calculations and computer simulations to show that it can achieve a very high power to detect a weak association 
when p can be made very large, say, to the thousands, millions or even more. We will also apply the new method to 
re-analyze a published genome-wide association study. 

Results 

Sharp null. Assume that we are interested in the association between a binary factor, X (X = 1: exposed; X = 0: 
unexposed) and a disease, D (D = I: diseased; D = 0: non-diseased). Consider also a binary auxiliary variable, Z(Z 
= 1 or 0), which is not of direct interest to us, but may help discern the possible association between X and D. Our 
method is based on testing whether the disease risk varies with X in any segment of the population demarcated by 
Z, i.e., testing the 'sharp null', 
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Figure 1 | Powers of MPT for the sharp null (solid lines, theoretical power assuming independent auxiliary variables with perturbation proportion of, 
from left to right respectively, 7t = 1.0, 0.2, 0.1 and 0.05) and the conventional test for the crude null (dashed line), under different number of 
subjects (a: n = 500, b: n = 1,000, c: n = 5,000) and number of auxiliary variables. The power of the n-based /crude increases with n. The power gain is 
only 30%, from 8% (n = 500, a) to 38% (n = 5,000, c) . The power of the p-based MPT increases with p in all scenarios that we considered and surpasses 
the power of /J rllde when p = 3,000 for n = l,p~ 60,000 form = 0.2, p « 250,000 for n = 0.1 andp = 1,000,000 for n = 0.05. Under n = 1, the power of 
MPT can reach nearly 100% when p is sufficiently large (p > -1,000,000 when n = 500; p> -100,000 when n = 1,000; p > -10,000 when n = 5,000). 
Under 7t < 1, —100% power is also possible if p can be made even larger. 



HI 



sharp 



Pr(D|X,Z)= Pr(D|Z) 



for both Z = 1 and Z = 0, against the alternative, 
Hf arp : Vr(D\X,Z)^ Pr(D|Z) 



(1) 



(2) 



for either Z = 1 or Z = 0. 

In a case-control study conducted in the study population, the 
Online Methods section shows that testing the sharp null amounts 
to testing the equality of odds ratios of X and Z, between the case 



group (OR^ e ) and the control group (OR^™ ), or equivalently, 
testing whether there is an 'interaction' between X and Z with regard 
to the risk of D on a multiplicative scale: 



OR^ e /OR™ ntro1 = 



(3) 



The following test statistic is proposed (see Supplementary Table SI 
for the cell counts): 



y 2 = 

A sharp 



(log6R- e -log6R- ntro1 ) 2 
VarflogOR^f) + Var (log OR^f™ 1 



log^ 



log die 



(4) 



y 4* + y 

;\fe{0,l} ; '' ;,fe{0,l} ** 



J 



case 



where j and k indicate the statuses of X and Z, respectively, and 
and n^ nt10 ' denote the numbers of case and control subjects with (X 
= j,Z = Ic), respectively. Xsharp ' s distributed asymptotically as a df = 
1 chi-squared distribution under the sharp null. 

Essentially, Xsharp ' s testing whether the observed OR^ e and 

OR^" ro1 are being 'perturbed' too much away from OR^2 Pulatlon 
(the population odds ratio of X and Z, and the expected value for 

both ORxz and OR^™ tro1 under the sharp null) than chance alone 
would dictate. We therefore refer to it as a 'perturbation test'. 



Multiple perturbation test. One single auxiliary variable may not 
perturb the above odds ratios very much. But if one has a whole panel 
of auxiliary variables (the Z, and the corresponding ^ 2 harp for i = 1, 
2,...,p), one can construct a very powerful multiple perturbation test 
(MPT), by summing up the perturbations from the many auxiliary 
variables (Zs) in the panel: 



MPT =EzLrp, i - 



(5) 



MPT as such is a p-based test. Its power to detect a non-null X should 
increase as more Zs are included in the panel (asp increases). On the 
other hand, a truly innocent X should be able to stand the test from 
multiple Zs, even if p goes to infinity. 

Figure 1 compares the theoretical powers of MPT and X 2 ru( j e (the 
conventional n-based test for the 'crude null'). For Xcmde' we nee d a 
very large study (n = —15,000) to attain an adequate power of 80%. 
On the other hand, the power of MPT increases with p, surpasses that 
of Z^rude' an< ^ tnen can reacn ~ 100% if p is sufficiently large. 
Supplementary Figure SI shows that to make up for the power loss 
in using dependent Zs, one can simply include more Zs in the panel. 
Supplementary Table S2 shows that MPT can maintain accurate type 
I error rates for all scenarios considered. 

The proposed MPT is applied to a public-domain data from a 
genome-wide association study of age-related macular degenera- 
tion 6 . Based on the data of chromosome 1 [a total 6639 single nuc- 
leotide polymorphisms (SNPs);p (the number of auxiliary variables) 
= 6638 for each SNP], the method detects two significant SNPs at 
false discovery rate (FDR) 13 of 0.05: rs2618034 (q-value = 0.026) and 
rs2014029 (q-value = 0.045) (Table 1). These two SNPs clearly stand 
out in the Manhattan plot (Supplementary Fig. S2). We deliberately 
reduce the number of auxiliary variables (p = 3000, randomly 
selected from 6639 SNPs). The two SNPs remain at the top, though 
not reaching significance (Supplementary Fig. S3). On the other 
hand, we expand the number of auxiliary variables (p > 6638, ran- 
domly selected from chromosome 2 to chromosome 22). The two 
SNPs are still significant (Supplementary Table S3). 
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Table 1 | Top five SNPs on chromosome 1 with smallest P-values by MPT for age-related macular degeneration data. The P-value for each 
SNP is obtained from 500,000 rounds of permutation. To adjust for multiple testing, FDR is controlled at 0.05 and the q-values are calculated 
(QVALUE software) 13 



Rank RefSNP (rs) number Minor allele frequency [%) P-value of MPT q-value Odds ratio P-value of Pearson chi-square test 



1 rs2618034 7.19 4.00xl0" 6 0.026 0.53 0.201 

2 rs2014029 5.82 1.40xl0" 5 0.045 2.10 0.166 

3 rs437749 43.15 2.66xl0" 4 0.357 0.94 0.865 

4 rs3753298 5.82 2.74xl0" 4 0.357 1.84 0.241 

5 rsl749409 8.97 4.28xl0" 4 0.357 0.51 0.147 



Figure 2 shows the fixation and drifting of P-values of the MPT. 
Although the 3 rd top SNP (rs437749) is not significant by our FDR 
standard (Table 1), it is already displaying a fixation pattern in our 
fixation/ drifting analysis (Fig. 2c). This suggests that if we can incorp- 
orate more perturbation SNPs into the MPT, SNP rs437749 may 
become significant. We deliberately remove the respective five largest 
Zsharp i' s m me MPTs for the two significant SNPs. Even so, a clear 
fixation pattern can still be seen for both (Supplementary Fig. S4). 

We also test run the proposed MPT on chromosome 19 (see 
Supplementary Note). Again, MPT proves to be very powerful. 
With FDR controlled at 0.05, it detects two significant SNPs 
(rs862703 and rs302437) (Supplementary Table S4) which also show 
fixations of P-values (Supplementary Fig. S5) and significantly stand 
out in the Manhattan plot (Supplementary Fig. S6). 

Discussion 

While confronted with high-throughput data, researchers often turn 
to dimension reduction methods to ease the severe penalty associated 
with testing myriads of variables 1418 . For our p-based method, 
dimensionality is not a curse but in fact is a blessing. We see that 



the power of the MPT actually increases as the number of auxiliary 
variables increases. Such 'the-more-the-better' principle also applies, 
when one is knowledgeable about which variables may be perturb- 
ative. In Figure 3, since the initial power is only 0.59, should research- 
ers add more variables into the test? We see as expected that adding 
more variables unselectively into the test will only dilute the power. 
However, upon more and more of low-info rmativity variables being 
added, the power can rise up again and then surpasses the original 
power. 

However, the p-based approach only goes so far as when the 
auxiliary variables have a non-zero informativeness (7 > 0, irrespec- 
tively of how small it may be). A computer can easily generate mil- 
lions and billions of random variables for us, but all these artificial 
data amount to nothing (I = 0, exactly). The more such variables 
being added, the more the power will be curtailed. Another caveat is 
that there is no use replicate the data at hand just to make the total 
number of auxiliary variables appear larger; the power simply won't 
budge with this maneuver. 

Age-related macular degeneration is a progressive disease in 
macula of the retina in which the pigment epithelium cells and the 
photoreceptor cells degenerate, causing gradual loss of central 
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Figure 2 | Fixation ((a-c), respectively for the 1 st to the 3 rd top SNPs on chromosome 1) and drifting ((d-f), for three purposefully chosen middle-to- 
bottom ranking SNPs on chromosome 1 ) of the P-values of MPT when only a certain number of perturbation SNPs are randomly incorporated for the 
age-related macular degeneration data. Each panel includes three lines (solid, dashed and dotted) representing three random incorporation sequences. 

Each P-value is obtained from 1,000,000 rounds of permutation. The P-values initially fluctuate a lot, when the number of perturbation SNPs 
incorporated is small. But beyond a certain point, the P-values become 'fixed' exactly to the abscissa (P-values = 0) (a and b), or almost so (P-values ~ 0) 
(c) . By comparison, the P-values of all three purposefully chosen middle-to-bottom ranking SNPs are 'drifting' all the way without showing any sign of a 
fixation (d-f). 
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Figure 3 | Power curve when a researcher includes the 100 informative variables (J = 0.02) known to him/her and then other low-informativity 
variables (dotted lines from left to right, for I = 0.001, 0.00025 and 0.0001, respectively) unselectively into MPT. 



vision 19,20 . With FDR controlled at 0.05, in this study we are able to 
identify two novel SNPs on chromosome 1 that are significantly 
associated with the disease. The first SNP (rs26 18034) is located in 
the intron region of KCND3 gene (potassium voltage-gated channel, 
Shal-related subfamily, member 3) on chromosome lpl3.2, and the 
second (rs2014029), the intron region of DTL gene (denticleless E3 
ubiquitin protein ligase homolog (Drosophila)) on lq32.3. KCND3 
gene encodes Kv4.3 regulating neuronal excitability 21 . Mutations in 
KCND3 gene have been identified as a cause for cerebellar neurode- 
generation 22,23 . In this regard, it is worthy to note that the retina 
photoreceptor cells are a specialized type of neurons which may also 
degenerate with aging. Meanwhile, DTL gene regulates p53 polyubi- 
quitination and protein stability 24 and the evidence to date suggests 
that p53 is a key regulator involved in the apotosis of retinal pigment 
epithelium cells 25 . All these findings further support that KCND3 and 
DTL genes may be causally related to the development of age-related 
macular degeneration. [As regards the two significant SNPs found on 
chromosome 19, their associations with age-related macular degen- 
eration are also biologically plausible (see Supplementary Note)]. 

The multiple perturbation test indeed is a very powerful test. The 
two significant SNPs on chromosome 1 (rs2618034 and rs2014029) 
that we identified in this study are only very weakly associated with 
age-related macular degeneration (marginal association odds ratios = 
0.53 and 2.10, respectively), and the traditional n-based method 
(Pearson chi-square test) comes nowhere near detecting them (P- 
values = 0.201 and 0.166, respectively) (Table 1). Even if we increase 
the total number of subjects from the present n = 146 (Klein et al.'s 
data 6 ) to n ~ 25,000 and n ~ 77,000 (Holliday et al.'s 7 and Fritsche 
et al.'s" meta-analyses data), the n-based method still cannot detect 
them. But this is not to say that the n-based method is useless. In fact, 
Klein et al. 6 themselves presented one SNP (rs380390) with an n- 
based P-value of 4.1 X 10~ 8 (significance after Bonferroni correction), 
but it is undetectable with our method. The p-based MPT is good at 
detecting interactive associations, i.e., associations that are prone to be 
perturbed by other factors, regardless of how weak the perturbations/ 
interactions may be, whereas the n-based traditional test is good at 
detecting marginal associations. It is important that the two different 
approaches can work side by side, complementing each other. 

The proposed method should have broad applications to other 
high-dimension (large p) -omics studies, such as epigenomic, 



transcriptomic, proteomic, metabolomic, and exposomic studies, 
etc. It would be even better to have a cross-omics study, and/or with 
all its study subjects further linked to existing government or private- 
sector databases, such as, data of health insurances, traffic violations, 
internet usages, etc. A researcher conducting such a data-mining 
study has the potentials to push the p (the number of auxiliary/ 
perturbation variables) to the millions, billions or even trillions, 
and be rewarded with a very high power for detecting a weak asso- 
ciation. Such a p-based method may set a stage for a new paradigm of 
statistical tests. 



Methods 

Crude null and sharp null in a case-control study. Let R = 1 indicate a subject is 
recruited in a study, R = 0, otherwise. In a case-control study, the recruitment process 
depends only on the disease status of a subject, that is, 



Pr(fl = l|Z,X,D) = Pr(fl=l|X,D) = Pr(fl = l|D). 
Under the crude null of 

Pr(D|X) = Pr(D), 

we have 

Pr(X,D,R = l) 



(6) 



(7) 



Pr(X|D,R = l) = 



and therefore, 



Odds? 



Pr (D,R = l) 
_ Pr(X)x Pr(D|X)x Pr(£ = l|X,D) 

Pr(D)x Pr(K=l|D) 
_Pr(X)x Pr(D)x Pr(«=l|D) 
Pr(D)x Pr(fl = l|D) 

= Pr(X), 

Pr(X=l|D=l,# = l) 
" Pr(X = 0|D=l,« = l) 

Pr(X=l) 
= Pr(X = 0) 

= Odds£ opuklion 

Pr(X=l|D = 0,R=l) 
" Pr(X = 0|D = 0,fi=l) 

= Odds c ° ntro1 - 



18) 



(9) 
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Under the sharp null of 



Pr(D|Z,X) = Pr(D|Z), 



(10) 



we have 



Pr (Z\X,D,R=l) = 



and therefore, 



OR? 



Pr(X,Z,D,.R=l) 
Pr(X,D,R = l) 
Pr(X)x Pr(Z|X)x Pr(D|Z,X)x Pr(R= 1|Z,X,D) 
Pr(X)x Pr(D|X)x Pr(£=l|X,D) 

Pr(X)x Pr(Z|X)x Pr(D|Z)x Pr(R = l|D) 
Pr(X)x Pr(D|X)x Pr(R = l|D) 

Pr(ZjX)x Pr(D|Z) 
Pr(D|X) ' 



Pr (Z= 1|X = 1,D= 1,R= 1)/Pr (Z = 0|X = 1,D= 1,« = 1) 
Pr (Z= 1|X = 0,D= 1,.R= 1)/Pr (Z = 0|X = 0,D = \,R = 1) 



(11) 



|"Pr(2: = l|X = l)x Pr{D=l|Z=l) 


/ 


~Pr(Z = 0|X=l)x Pr(D=l|Z = 0)l 


! Pr(D=l|X=l) 


Pr(D=l|X=l) 


rPr(Z = l|X = 0)x Pr(D=l|Z = l)~ 


/ 


~Pi(Z = 0|X = 0)x Pr(D=l Z = 0)1 


Pr(D=l|X = 0) 


Pr(D=l|X=0) 



Pr (Z=l|X=l)/Pr(Z = 0|X = l) 
" Pr(Z=l|X = 0)/Pr(Z = 0|X = 0) 

^-.p, population 



(12) 



|"Pr(2: = l|X = l)x Pr(D = 0|Z = l) 


/ 


~Pr(Z = 0|X=l)x Pr(D = 0|Z = 0)l 


Pr(D = 0|X=l) 


Pr(D = 0|X = l) 


|~Pr(Z = l|X = 0)x Pr(D = 0|Z = l)~ 


/ 


~Pi(Z = 0|A' = 0)x Pr(I? = 0Z = O)l 


Pr(D = 0|X = 0) 


Pr(D = 0|X = 0) 



Pr (Z=l|X=l,D = 0,fi=l)/Pr(Z = 0|X=l,D = 0,« = l) 
" Pr (Z = 1 |X = 0,D = 0,R = 1)/Pr (Z = 0|X = 0,D = 0,R = 1) 

= ORyi"™ 1 . 



Testing crude null: n-based test. In a case-control study conducted in the study 
population, testing the crude null amounts to testing the equality of prevalence odds 
ofX, between the case group (Odds" se ) and the control group (Odds™ nIml ), or 
equivalently, testing whether the odds ratio of X and D equals one: 



OR' 



case— control 



-0dds5? s 7Odds^ nCr0l -l. 



Supplementary Table SI presents the cell counts of a case-control study (ignore the 
variable, 2, for now). One may use the following test statistic: 



/- crude " 





(log6dds° se 


-log Odds™ 1 ™ 1 ) 


2 

1 


Var( 


'logOdds?*'} 


l+Varj 


^ogoddsr™ 1 ) 



- log 



(14) 



je{0,l["'' + ;e{0,1}"*+ 



/cmde i s distributed asymptotically as a chi-squared distribution with one degree of 
freedom (df) under the crude null. 



Power comparison. The power of the traditional n-based x^ rLkle is: 
Power of y 2 cmi ~ Pr[zi = iM>Zdf=u-a] , 



(15) 



where Xm=i(a) is a df = 1 noncentral chi-squared distribution with noncentrality 
parameter, 



(log OddsJ se - log Odds™" 



■ E ; 



(16) 



Note that the power of / crLlde is determined by the significance level: a, the sample size: 
n (or more exactly the expected cell counts), and the effect size: 



logOddsJ^ - logOdds™ ntro1 . 



(17) 



Assuming that a panel of independent auxiliary variables contains a certain 
proportion, tu (0 < 7t < 1), of perturbative 2s such that log(OR™|7 OR^f ro1 ) follows 
a normal distribution with a mean of zero and a variance of a 2 > 0 the theoretical 



power of the p-based MPT based on such panel is: 
Power of MPT=^ Pr I > 



^df=p,l-a 

p " i + e 2 



where 



;,fe={0,l) E f»~) ;,M0,1( E (n= n ' ml ) 



E 



(18) 



(19) 



Note that in addition to a and n, the power of MPT is also determined by the total 
number of auxiliary variables: p, and the 'informativeness' of the auxiliary variables: 

I = nxo 2 (20) 

(the product of perturbation proportion and perturbation strength). 
We consider an X that is very weakly associated with D: 



ORy 



1 - Odds J s 70dds^ oncro1 -1.1. 



(21) 



We also consider a panel of independent 2s. The logarithm of OR^ pul ' ltlon follows a 
normal distribution with a mean of zero and a variance of 0.5 (a probability of 95% 
that an OR^ pul ' ltlon is between 0.25 — 4.00). We consider four different values for the 
perturbation proportion (k — 1.0, 0.2, 0.1 and 0.05, respectively), with each per- 
turbative 2 having a weak perturbation effect (a 2 — 0.001, i.e., a probability of 95% 
that the ratio, OR^f / OR xz is between 0.94 - 1.06). The informativeness of 2s is 
therefore 0.001, 0.0002, 0.0001 and 0.00005, respectively. For convenience, the pre- 
valence of X and each and every one of 2s is set at 40% for the control group. The 
significance level is set at a — 0.05. 

Calculation of p-value using permutation. If the 2s in the panel are independent of 
one another, MPT is asymptotically a df = p chi-squared distribution under the sharp 
null. The critical value of MPT therefore is simply Xdf =p i -a when the level of 
significance is set at rx In actual practice however, 2s may not be independent of one 
another and sample size may be too small for an adequate chi-square approximation. 
Therefore, we need to rely on computer-intensive methods to simulate the null 
sampling distribution of MPT. Withp — 1, Buzkovaeta/. pointed out that the method 
of parametric bootstrap is valid but the method of permutation (shuffling disease 
status between subjects) is conservative (overestimating the critical value) 26 . However, 
we found that as p increases, the permutation method remains slightly conservative 
but the parametric method becomes too liberal (underestimating the critical value). 
To err on the safe side, we therefore propose to use the permutation method to 
approximate the null sampling distribution of MPT. 

Monte-Carlo simulation. We perform Monte-Carlo simulation to study the 
statistical properties of MPT empirically. The parameter setting is the same as the 
previous section. The sample size is set at n — 1,000. But to avoid the heavy 
computation burdens of simulating a very large panel of 2s, this time we let 2s have a 
perturbation proportion of 1.0 and a larger perturbation strength (a 2 — 0.004, a 
probability of 95% that OR^7 OR xz ,ro1 is between 0.88 ~- 1.13). Additionally, we 
also consider dependent 2s. Specifically, we simulate 2s using a first-order Markov 
chain, in both the case and the control groups, assuming an odds ratio between 
successive 2s of 2.0 (mild dependency) and 5.0 (strong dependency), respectively. We 
perform a total of 1,000 simulations. In each round of the simulation, we conduct 
1,000 permutations to obtain an empirical P-value for MPT. The power of MPT is 
then calculated as the proportion of the simulations with a P-value < 0.05. 

The type I error rates of MPT for panels of independent and dependent 2s (odds 
ratio between successive 2s = 5.0) are also empirically checked using Monte-Carlo 
simulations, for different number of subjects (n = 500, 1,000, 5,000) and number of 
auxiliary variables (p — 100, 1,000, 5,000). (Both n and p are assumed to be fixed by 
design.) Here X is a sharp null, that is, X has no effect on disease in any level stratified 
by 2s (no perturbation effect for all 2s: I" — n X a 2 — 0). Other parameters are the 
same as in power simulations. We perform a total of 1,000 simulations, each round 
with 1,000 permutations. 

Application to real data. MPT is applied to a public-domain data from a genome- 
wide association study of age-related macular degeneration 6 . The study recruited 146 
individuals (96 cases and 50 controls) and genotyped 116,212 single nucleotide 
polymorphisms (SNPs). A total of 6,639 SNPs located on chromosome 1 (where 
previous studies 27 ' 28 have identified a number of significant susceptibility genes) with 
call rate > 95%, minor allele frequency > 5% and in Hardy- Weinberg equilibrium in 
the control group is included in the analysis. At each SNP, heterozygote and variant 
homozygote are grouped together. 

In the analysis, each SNP takes turn to be the X, and the remaining SNPs, the 2s. 
(The number of auxiliary variables is p — 6638, for each and every one of the total 
6639 SNPs. This number is set prior to the MPT analysis to avoid complicating the 
multiple testing problem.) For a low-frequency SNP, some of the cells in 
Supplementary Table Si may be empty. In that case, it is totally uninformative as a 
perturbation variable, because its y* haip statistic is zero with the convention: 0 X logO 
- 0. The P-value of the MPT for each SNP is obtained from 500,000 rounds of 
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permutation. Because we repeatedly test each and every one of the 6639 SNPs for 
significance, for multiple testing correction the false discovery rate (FDR) is con- 
trolled at 0.05 using the q-values (QVALUE software) 13 . (Because of the dependence 
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nominal 0.05 13 - 29 - 30 .) Note that our fixation/drifting analysis does not create a multiple 
testing problem by itself, because the procedure was done only after the significance of 
a SNP had been determined. 
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